51 research outputs found

    20 GB in 10 minutes: a case for linking major biodiversity databases using an open socio-technical infrastructure and a pragmatic, cross-institutional collaboration

    Get PDF
    Biodiversity information is made available through numerous databases that each have their own data models, web services, and data types. Combining data across databases leads to new insights, but is not easy because each database uses its own system of identifiers. In the absence of stable and interoperable identifiers, databases are often linked using taxonomic names. This labor intensive, error prone, and lengthy process relies on accessible versions of nomenclatural authorities and fuzzy-matching algorithms. To approach the challenge of linking diverse data, more than technology is needed. New social collaborations like the Global Unified Open Data Architecture (GUODA) that combines skills from diverse groups of computer engineers from iDigBio, server resources from the Advanced Computing and Information Systems (ACIS) Lab, global-scale data presentation from EOL, and independent developers and researchers are what is needed to make concrete progress on finding relationships between biodiversity datasets. This paper will discuss a technical solution developed by the GUODA collaboration for faster linking across databases with a use case linking Wikidata and the Global Biotic Interactions database (GloBI). The GUODA infrastructure is a 12-node, high performance computing cluster made up of about 192 threads with 12 TB of storage and 288 GB memory. Using GUODA, 20 GB of compressed JSON from Wikidata was processed and linked to GloBI in about 10–11 min. Instead of comparing name strings or relying on a single identifier, Wikidata and GloBI were linked by comparing graphs of biodiversity identifiers external to each system. This method resulted in adding 119,957 Wikidata links in GloBI, an increase of 13.7% of all outgoing name links in GloBI. Wikidata and GloBI were compared to Open Tree of Life Reference Taxonomy to examine consistency and coverage. The process of parsing Wikidata, Open Tree of Life Reference Taxonomy and GloBI archives and calculating consistency metrics was done in minutes on the GUODA platform. As a model collaboration, GUODA has the potential to revolutionize biodiversity science by bringing diverse technically minded people together with high performance computing resources that are accessible from a laptop or desktop. However, participating in such a collaboration still requires basic programming skills

    WorldFAIR Project (D10.1) Agriculture-related pollinator data standards use cases report

    Get PDF
    Although pollination is an essential ecosystem service that sustains life on Earth, data on this vital process is largely scattered or unavailable, limiting our understanding of the current state of pollinators and hindering effective actions for their conservation and sustainable management. In addition to the well-known challenges of biodiversity data management, such as taxonomic accuracy, the recording of biotic interactions like pollination presents further difficulties in proper representation and sharing. Currently, the widely-used standard for representing biodiversity data, Darwin Core, lacks properties that allow for adequately handling biotic interaction data, and there is a need for FAIR vocabularies for properly representing plant-pollinator interactions. Given the importance of mobilising plant-pollinator interaction data also for food production and security, the Research Data Alliance Improving Global Agricultural Data Community of Practice has brought together partners from representative groups to address the challenges of advancing interoperability and mobilising plant-pollinator data for reuse. This report presents an overview of projects, good practices, tools, and examples for creating, managing and sharing data related to plant-pollinator interactions, along with a work plan for conducting pilots in the next phase of the project. We present the main existing data indexing systems and aggregators for plant-pollinator interaction data, as well as citizen science and community-based sourcing initiatives. We also describe current challenges for taxonomic knowledge and present two data models and one semantic tool that will be explored in the next phase. In preparation for the next phase, which will provide best practices and FAIR-aligned guidelines for documenting and sharing plant-pollinator interactions based on pilot efforts with data, this Case Study comprehensively examined the methods and platforms used to create and share such data. By understanding the nature of data from various sources and authors, the alignment of the retrieved datasets with the FAIR principles was also taken into consideration. We discovered that a large amount of data on plant-pollinator interaction is made available as supplementary files of research articles in a diversity of formats and that there are opportunities for improving current practices for data mobilisation in this domain. The diversity of approaches and the absence of appropriate data vocabularies causes confusion, information loss, and the need for complex data interpretation and transformation. Our explorations and analyses provided valuable insights for structuring the next phase of the project, including the selection of the pilot use cases and the development of a ‘FAIR best practices’ guide for sharing plant-pollinator interaction data. This work primarily focuses on enhancing the interoperability of data on plant-pollinator interactions, envisioning its connection with the effort WorldFAIR is undertaking to develop a Cross-Domain Interoperability Framework. Visit WorldFAIR online at http://worldfair-project.eu. WorldFAIR is funded by the EC HORIZON-WIDERA-2021-ERA-01-41 Coordination and Support Action under Grant Agreement No. 101058393

    Liberating host–virus knowledge from biological dark data

    Get PDF
    Connecting basic data about bats and other potential hosts of SARS-CoV-2 with their ecological context is crucial to the understanding of the emergence and spread of the virus. However, when lockdowns in many countries started in March, 2020, the world's bat experts were locked out of their research laboratories, which in turn impeded access to large volumes of offline ecological and taxonomic data. Pandemic lockdowns have brought to attention the long-standing problem of so-called biological dark data: data that are published, but disconnected from digital knowledge resources and thus unavailable for high-throughput analysis. Knowledge of host-to-virus ecological interactions will be biased until this challenge is addressed. In this Viewpoint, we outline two viable solutions: first, in the short term, to interconnect published data about host organisms, viruses, and other pathogens; and second, to shift the publishing framework beyond unstructured text (the so-called PDF prison) to labelled networks of digital knowledge. As the indexing system for biodiversity data, biological taxonomy is foundational to both solutions. Building digitally connected knowledge graphs of host–pathogen interactions will establish the agility needed to quickly identify reservoir hosts of novel zoonoses, allow for more robust predictions of emergence, and thereby strengthen human and planetary health systems.info:eu-repo/semantics/publishedVersio

    Open Science Principles for Accelerating Trait-Based Science Across the Tree of Life

    Get PDF
    Synthesizing trait observations and knowledge across the Tree of Life remains a grand challenge for biodiversity science. Species traits are widely used in ecological and evolutionary science, and new data and methods have proliferated rapidly. Yet accessing and integrating disparate data sources remains a considerable challenge, slowing progress toward a global synthesis to integrate trait data across organisms. Trait science needs a vision for achieving global integration across all organisms. Here, we outline how the adoption of key Open Science principles—open data, open source and open methods—is transforming trait science, increasing transparency, democratizing access and accelerating global synthesis. To enhance widespread adoption of these principles, we introduce the Open Traits Network (OTN), a global, decentralized community welcoming all researchers and institutions pursuing the collaborative goal of standardizing and integrating trait data across organisms. We demonstrate how adherence to Open Science principles is key to the OTN community and outline five activities that can accelerate the synthesis of trait data across the Tree of Life, thereby facilitating rapid advances to address scientific inquiries and environmental issues. Lessons learned along the path to a global synthesis of trait data will provide a framework for addressing similarly complex data science and informatics challenges

    The Bari Manifesto : An interoperability framework for essential biodiversity variables

    Get PDF
    Essential Biodiversity Variables (EBV) are fundamental variables that can be used for assessing biodiversity change over time, for determining adherence to biodiversity policy, for monitoring progress towards sustainable development goals, and for tracking biodiversity responses to disturbances and management interventions. Data from observations or models that provide measured or estimated EBV values, which we refer to as EBV data products, can help to capture the above processes and trends and can serve as a coherent framework for documenting trends in biodiversity. Using primary biodiversity records and other raw data as sources to produce EBV data products depends on cooperation and interoperability among multiple stakeholders, including those collecting and mobilising data for EBVs and those producing, publishing and preserving EBV data products. Here, we encapsulate ten principles for the current best practice in EBV-focused biodiversity informatics as 'The Bari Manifesto', serving as implementation guidelines for data and research infrastructure providers to support the emerging EBV operational framework based on trans-national and cross-infrastructure scientific workflows. The principles provide guidance on how to contribute towards the production of EBV data products that are globally oriented, while remaining appropriate to the producer's own mission, vision and goals. These ten principles cover: data management planning; data structure; metadata; services; data quality; workflows; provenance; ontologies/vocabularies; data preservation; and accessibility. For each principle, desired outcomes and goals have been formulated. Some specific actions related to fulfilling the Bari Manifesto principles are highlighted in the context of each of four groups of organizations contributing to enabling data interoperability - data standards bodies, research data infrastructures, the pertinent research communities, and funders. The Bari Manifesto provides a roadmap enabling support for routine generation of EBV data products, and increases the likelihood of success for a global EBV framework.Peer reviewe

    globalbioticinteractions/elton 0.3.11

    No full text
    Features Improvements include improvements related to https://github.com/jhpoelen/eol-globi-data/releases/tag/v0.9.4 Bug fixe

    globalbioticinteractions/elton 0.3.13

    No full text
    Features Improvements include improvements related to https://github.com/jhpoelen/eol-globi-data/releases/tag/v0.9.6 Bug fixe

    Global Biotic Interactions: A Catalyst For Integrating Existing Interaction Datasets, Connecting Data Curators And Developing Data Exchange Methods

    No full text
    Since 2013, Global Biotic Interactions (GloBI, globalbioticinteractions.org, Poelen et al. 2014) has taken an opportunistic, decentralized approach to integrating, and make accessible, existing species interaction datasets. Rather than expecting dataset curators to conform to some publication regime, methods were developed to automatically and algorithmically discover, parse and link existing datasets without the need to reformat, relocate, or transfer ownership of, the existing dataset. The automated nature of GloBI helps to: (a) automate propagation of dataset updates (b) quickly detect data integration issues (e.g. outage, change in data format), (c) integrate new datasets without having to contact some central office, (d) avoid permanent data loss due to software integration bugs, and, last but not least, (e) access to datasets even after GloBI goes away. As far back as 1927, Charles Elton, a founding father of modern ecology, realized the importance of linking natural history knowledge stored in professional journals while acknowledging the value of local (amateur) knowledge. Despite technological advances, details on how species interact are only still largely available by studying professional journals, manually inspecting datasets or striking up a conversation with a ecologist, farmer or citizen scientist. The lack of access to species interaction data is known as the Eltonian shortfall (Hortal et al. 2015). GloBI’s mission is to address this shortcoming. By borrowing from software engineering practices such as test driven development and continuous integration, re-purposing freely available platforms such as GitHub, Zenodo, Travis CI and integrating with many existing biodiversity services (e.g. globalnames.org, eol.org, crossref.org, geonames.org), GloBI has grown to include about 2.8M interaction records spanning 100k taxa (source: globalbioticinteractions.org/references, 17 July 2017) and has established bi-directional links to projects including, but not limited to, the NCBI Taxonomy, World Register of Marine Species, Encyclopedia of Life and iNaturalist. As GloBI continues to link existing species interaction datasets, and form a loosely affiliated community of data curators, educators and (citizen) scientists, the data integration platform is well-suited to play an active and experimental role in the development of novel methods to more easily mobilize and integrate species interaction data in an effort to realize Charles Elton's dream to "[...] provide conceptions which can link up into some complete scheme the colossal store of facts about natural history which has accumulated up to date in this rather haphazard manner. [...]" (Elton 1927)
    • …
    corecore